This dataset pertains to cable and broadcast news coverage between the dates January 21, 2020 and June 12, 2020. The cable networks included in this dataset are: CNN, MSNBC, and Fox News. The broadcast networks included are: ABC, CBS, and NBC. We used Nexis Uni to gather transcripts for the following programs on weeknights for the aforementioned time range. The included programs are:
1.	CNN: Anderson Cooper 360 Degrees, The Lead with Jake Tapper, Cuomo Prime Time, Erin Burnett OutFront, CNN Tonight, and The Situation Room
2.	Fox News: The Five, Special Report with Bret Baier, The Story with Martha MacCallum, Tucker Carlson Tonight, Hannity, Ingraham Angle, and Fox News @ Night
3.	MSNBC: MTP Daily, The Beat with Ari Melber, All in with Chris Hayes, The Rachel Maddow Show, The Last Word with Lawrence O’Donnell, 11th Hour with Brian Williams, and Hardball
4.	ABC World News Tonight
5.	CBS Evening News
6.	NBC Nightly News
These programs cover all prime-time nightly news across the cable and broadcast networks.


Our next goal, upon gathering all news transcripts, was to filter out content unrelated to COVID-19. To achieve this goal, we performed human coding of the transcripts to create training and test datasets.
Trained coders classified each paragraph as one of the three following categories:
1.	Directly related to COVID-19 (that is, the paragraph included words directly related to COVID-19, including the health, political, economic, and other implications of the disease), 
2.	Indirectly related to COVID-19 (that is, the paragraph did not include words that directly identified COVID-19, but the context of the transcripts made it clear that the health, political, economic, or other implications of the disease were being discussed)
3.	Not related to COVID-19. 


Two coders manually coded the transcripts for 52 news broadcasts. 
Reliability was high (Direct COVID-19: Krippendorff’s alpha = .87;
Indirect COVID-19: Krippendorff’s alpha = .85).
After establishing reliability, a single coder manually classified the transcripts from an additional 214 broadcasts. These transcripts included one randomly selected transcript a week from two programs on each network and a randomly selected broadcast for each month
for the remaining programs. This resulted in a total of 44,643 manually labeled paragraphs.


Final step was to classify the rest of the transcripts/paragraphs according to their relevance to COVID-19. Here we performed binary classification and collapsed the “Directly related to COVID-19” and “Indirectly related to COVID-19” categories into a single category denoting the content was related to COVID-19. This prediction was done using a fine-tuned BERT model (other classifiers did not perform as well and are omitted inn this description).

This dataset shared consists of two separate files: 1) covid19-cable-broadcast-labeled.csv and 2) covid19-cable-broadcast-predicted.csv, which are a culmination of these efforts. 


Data File 1: covid19-cable-broadcast-labeled.csv

This file contains human coded content. The columns in this file are as follows:

1.	network: this column hhas 6 possible values (abc, cnn, fox, cbs msnbc, nbc)
2.	program: there are 25 possible values for this column (11thhourwithbrianwilliams, allinwithchrishayes, andersoncooper360, cnntonight, cuomoprimetime, erinburnettoutfront, eveningnews, foxnewsatnight, Hannity, hardball, ingrahamangle, msnbclive, mtpdaily, nightlynews, 
special, specialreportwithbretbaier, thebeatwitharimelber, thefive, thelastwordwithlawrenceodonnell, theleadwithjaketapper, therachelmaddowshow, thesituationroom, thestorywithmarthamaccallum, tuckercarlsontonight, 
worldnewstonight) corresponding to the programs listed above.
3.	date: this is the date when the corresponding program aired
4.	speech_turn: each transcript is characterizing the flow of coverage for a particular program at a particular date. This generally requires speakers taking speech turns. This integer encodes this value
5.	paragraph_sequence: sometimes a given turn takes multiple paragraphs. This column gives the paragraph sequence number in each turn
6.	paragraph: the text of the paragraph
7.	category: this has three possible values (covid_direct, covid_indirect, non_covid) corresponding to the labels provided by human coders. 


Data File 2: covid19-cable-broadcast-predicted.csv

This file contains data labeled by our best performing supervised classifier (finetuned BERT). The columns in this file are as follows:

1.	network: this column hhas 6 possible values (abc, cnn, fox, cbs msnbc, nbc)
2.	program: there are 25 possible values for this column (11thhourwithbrianwilliams, allinwithchrishayes, andersoncooper360, cnntonight, cuomoprimetime, erinburnettoutfront, eveningnews, foxnewsatnight, Hannity, hardball, ingrahamangle, msnbclive, mtpdaily, nightlynews, 
special, specialreportwithbretbaier, thebeatwitharimelber, thefive, thelastwordwithlawrenceodonnell, theleadwithjaketapper, therachelmaddowshow, thesituationroom, thestorywithmarthamaccallum, tuckercarlsontonight, 
worldnewstonight) corresponding to the programs listed above.
3.	date: this is the date when the corresponding program aired
4.	speech_turn: each transcript is characterizing the flow of coverage for a particular program at a particular date. This generally requires speakers taking speech turns. This integer encodes this value
5.	paragraph_sequence: sometimes a given turn takes multiple paragraphs. This column gives the paragraph sequence number in each turn
6.	paragraph: the text of the paragraph
7.	predicted_label: this has two possible values. The value 0 denotes that the paragraph was predicted to be unrelated to covid and 1 that denotes the paragraph was predicted to be related to covid.

